The COVID-19 pandemic was one of the most impactful medical events in human history. As one of the most deadly and wide spread infections in recorded history, it has changed American society in ways we are still trying to understand. One of the most obvious impacts was the death toll with nearly 1.2m Americans succumbing to the virus. While masking and social distancing likely helped limited the spread of the infection, only a vaccine could be considered a reasonable treatment against both infection and death for those infected.
We chose to tackle the impact of vaccines on COVID-19 deaths by attempting to answer three key SMART questions:
These questions represent reasonable evaluations of the available data despite probable confounds that we can’t directly deal with in an Exploratory Data Analysis (EDA).
Our first step was to read in and format the data from our data sources. This process is straight forward in so far as our data sources are federal, public datasets published on respected websites. This data is also longitudinal administrative data making it relatively reliable and timely ensuring a higher quality data product.
The COVID Cases data was obtained from the [usafacts.org]https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/ which is a not-for-profit organisation that aggregates data from public sources. The explanation of their methodology can be found from here: * https://usafacts.org/articles/detailed-methodology-covid-19-data/. Initial dataset contained 3194 rows and 1269 columns.
# Read in covid case data file
# Source https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
cases <- data.frame(read.csv("covid_confirmed_usafacts.csv"))
# Drop unused (redundant) columns
cases <- subset(cases, select = -c(1,2,4))
# Aggregate by state
cases <- aggregate(. ~ State, data = cases, FUN = sum)
# Transpose dataset so that rows represent days and columns represent states
cases <- pivot_longer(cases, cols = -State, names_to = "Date", values_to = "CovidCases")
# Convert char string dates to date type
cases$Date <- as.Date(cases$Date, format = "X%m.%d.%Y")
# Fill missing days, select Mondays, and calculate weekly changes
cases <- cases %>%
group_by(State) %>% #group by state
complete(Date = seq(min(Date), max(Date), by = "day")) %>% #pick min and max date in dataset and create a sequence by day.basically it is used for filling missing days
fill(CovidCases, .direction = "down") %>% #dates with missing covid cases are filled with previous day values until a non-na value is found
filter(wday(Date) == 2) %>% #filter all cases for mondays
mutate(Weekly_CovidCases = CovidCases - lag(CovidCases, 1)) %>% #present monday's data - last monday's data
drop_na() # Remove NAs
# Add one week lag to lineup with cases
cases$Date <- cases$Date %m+% weeks(1)
head(cases)
## # A tibble: 6 × 4
## # Groups: State [1]
## State Date CovidCases Weekly_CovidCases
## <chr> <date> <int> <int>
## 1 AK 2020-02-10 0 0
## 2 AK 2020-02-17 0 0
## 3 AK 2020-02-24 0 0
## 4 AK 2020-03-02 0 0
## 5 AK 2020-03-09 0 0
## 6 AK 2020-03-16 0 0
The Covid Data is initially aggregated from county wise to state wise, then transposed so that dates which were column headers initially were rows now. The initial data did not have covid cases count for all days, so we have generated list of dates considering the start date and end date in the list. Later, these days were mapped with the covid cases count present in the source and days that did not have case count in the source were marked with values of previous day. Once the daily covid count is obtained, weekly cases count in then calculated. However, we know that deaths typically occur after infection and the [CDC]https://www.cdc.gov/nchs/nvss/vsrr/covid_weekly/index.htm has provided guidence that death data is typically lagged 1-2 weeks behind. We therefore lagged our case data one week to align better with the death data. After all these transformations, the final data frame cases) had 9231 rows and 4 columns
The COVID Deaths data was obtained from the same website [usafacts.org]https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/. This dataset also contained 3194 rows and 1269 columns.
# Read in covid deaths data file
# Source https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
deaths <- data.frame(read.csv("covid_deaths_usafacts.csv"))
# Drop unused (redundant) columns
deaths <- subset(deaths, select = -c(1,2,4))
# Aggregate by state
deaths <- aggregate(. ~ State, data = deaths, FUN = sum)
# Transpose dataset so that rows represent days and columns represent states
deaths <- pivot_longer(deaths, cols = -State, names_to = "Date", values_to = "DeathCount")
# Convert char/string dates to date type
deaths$Date <- as.Date(deaths$Date, format = "X%m.%d.%Y")
# Fill missing days, select Mondays, and calculate weekly changes
deaths <- deaths %>%
group_by(State) %>%
complete(Date = seq(min(Date), max(Date), by = "day")) %>%
fill(DeathCount, .direction = "down") %>%
filter(wday(Date) == 2) %>% # Mondays
mutate(Weekly_DeathCount = DeathCount - lag(DeathCount, 1)) %>%
drop_na() # Remove NAs from first calculation
# Display example of finished data
head(deaths)
## # A tibble: 6 × 4
## # Groups: State [1]
## State Date DeathCount Weekly_DeathCount
## <chr> <date> <int> <int>
## 1 AK 2020-02-03 0 0
## 2 AK 2020-02-10 0 0
## 3 AK 2020-02-17 0 0
## 4 AK 2020-02-24 0 0
## 5 AK 2020-03-02 0 0
## 6 AK 2020-03-09 0 0
The cleanup for deaths data is also done in the same way as cases. After the cleanup, data frame deaths had 9231 rows and 4 columns
Vaccination data is obtained from the source [cdc.gov]https://www.cdc.gov/coronavirus/2019-ncov/vaccines/index.html. It contained 38489 rows and 109 columns.
# Read in population data file
vaccination <- data.frame(read.csv("Vaccination.csv"))
# Convert char string dates to date type
vaccination$Date <- mdy(vaccination$Date)
# Split data
vacc_all <- vaccination[, c("Date", "State", "Administered_Dose1_Pop_Pct")]
vacc_65 <- vaccination[, c("Date", "State", "Administered_Dose1_Recip_65PlusPop_Pct")]
# Fill missing days, select Mondays, and calculate weekly changes
vacc_all <- vacc_all %>%
group_by(State) %>%
complete(Date = seq(min(Date), max(Date), by = "day")) %>%
fill(Administered_Dose1_Pop_Pct, .direction = "down") %>%
filter(wday(Date) == 2) %>% # Mondays
mutate(Weekly_Vacc_all = Administered_Dose1_Pop_Pct - lag(Administered_Dose1_Pop_Pct, 2)) %>%
drop_na() # Remove NAs from first calculation
# Fill missing days, select Mondays, and calculate weekly changes
vacc_65 <- vacc_65 %>%
group_by(State) %>%
complete(Date = seq(min(Date), max(Date), by = "day")) %>%
fill(Administered_Dose1_Recip_65PlusPop_Pct, .direction = "down") %>%
filter(wday(Date) == 2) %>% # Mondays
mutate(Weekly_Vacc_65 = Administered_Dose1_Recip_65PlusPop_Pct - lag(Administered_Dose1_Recip_65PlusPop_Pct, 2)) %>%
drop_na() # Remove NAs from first calculation
#Merge back together
vaccination <- merge(vacc_all, vacc_65, by = c("Date", "State"), all = TRUE)
# Display example of finished data
vaccination$Date <- vaccination$Date %m+% weeks(2)
head(vaccination)
## Date State Administered_Dose1_Pop_Pct Weekly_Vacc_all
## 1 2021-01-11 AK 0 0
## 2 2021-01-11 AL 0 0
## 3 2021-01-11 AR 0 0
## 4 2021-01-11 AZ 0 0
## 5 2021-01-11 CA 0 0
## 6 2021-01-11 CO 0 0
## Administered_Dose1_Recip_65PlusPop_Pct Weekly_Vacc_65
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
For this dataset as well, we performed the same process of data cleaning as cases, deaths but separately for vaccinations of all age groups and 65plus separately and later merged the results to single dataframe vaccination that contained 6324 rows and 6 columns. As with cases, the [CDC]https://www.cdc.gov/coronavirus/2019-ncov/vaccines/different-vaccines/how-they-work.html notes that vaccinations can take “a few weeks” to be effective. We therefore lagged the vaccination data two weeks to better align with deaths.
We also wanted to examine how state policy impacted covid deaths. To do so, we have obtained a dataset of state and county COVID-19 policies from [healthdata.gov]https://healthdata.gov/dataset/COVID-19-State-and-County-Policy-Orders/gyqz-9u7n/about_data. The data contained 4219 rows and 11 columns representing the start and stop days of policies meant to limit the spread of COVID-19.
# ## Prepare Policy Data
# policy <- read.csv("COVID-19_State_and_County_Policy_Orders_20240420.csv")
# # Filter out county data
# policy <- subset(policy, policy_level=="state")
# policy <- policy %>%
# select(date, state_id, policy_type, start_stop) %>%
# filter(policy_type %in% c("Mandate Face Mask Use By All Individuals In Public Spaces",
# "Mandate Face Mask Use By All Individuals In Public Facing Businesses",
# "Food and Drink")) %>%
# unique() # Remove duplicates
# # Split start and end dates
# start<- subset(policy, start_stop=="start")
# colnames(start)[which(names(start) == "date")] <- "start"
# stop<- subset(policy, start_stop=="stop")
# colnames(stop)[which(names(stop) == "date")] <- "stop"
# # Merge back together
# adj_policy <- merge(start, stop, by.x=c('state_id', 'policy_type'), by.y=c('state_id', 'policy_type'), all=T)
# # Drop unused columns
# adj_policy <- adj_policy[,c("state_id", "policy_type", "start", "stop")]
# # Backfill missing end dates
# adj_policy$stop[is.na(adj_policy$stop)] <- "2022/12/31"
# # Convert date from character to Date type
# adj_policy$start <- as.Date(adj_policy$start, format = "%Y/%m/%d")
# # Convert date from character to Date type
# adj_policy$stop <- as.Date(adj_policy$stop, format = "%Y/%m/%d")
# # Save out to check data
# write.csv(adj_policy, "Policy check.csv")
# NOTE: There were 5 cases where the start/stop dates were reversed in the data and 6 cases had the same start/stop dates. These were manually corrected and read back in.
adj_policy <- read.csv("Policy check.csv")
# Format dates
adj_policy$start<- mdy(adj_policy$start) #converting to proper date objects in R using mdy function
adj_policy$stop<- mdy(adj_policy$stop)
# Fill missing dates
for(i in 1:nrow(adj_policy)){ #looping the dataset from row 1 to last row
Date <- as.Date(seq(adj_policy[i, 3], adj_policy[i, 4], by = "days")) #generating dates from start date to end date for each row
policy_type <- rep(adj_policy[i, 2], length(Date)) #replicate policytype for each date generated above
State <- rep(adj_policy[i, 1], length(Date)) #replicate state value for each date generated above
tmp<- data.frame(cbind(policy_type, State)) #merge policytype and state columns and store to tmp dataframe
# note, this is done because cbind converts dates to numerics
tmp$Date <- Date #merge date columns to tmp df
if(exists('long_policy')){ #storing the generated dates into long-policy
long_policy <- rbind(long_policy, tmp)
} else {
long_policy <-tmp
}
}
# Create dummies for the policies
long_policy <- dummy_cols(long_policy, select_columns = "policy_type")
# Fix column names
colnames(long_policy)[which(names(long_policy) == "policy_type_Mandate Face Mask Use By All Individuals In Public Facing Businesses")] <- "Masks_Bus"
colnames(long_policy)[which(names(long_policy) == "policy_type_Mandate Face Mask Use By All Individuals In Public Spaces")] <- "Masks_Public"
colnames(long_policy)[which(names(long_policy) == "policy_type_Food and Drink")] <- "Food_Drink"
# Drop policy type column
long_policy<- long_policy[ , c("State", "Date", "Food_Drink", "Masks_Bus", "Masks_Public")]
# Display the long_policy
head(long_policy)
## State Date Food_Drink Masks_Bus Masks_Public
## 1 AK 2020-04-24 1 0 0
## 2 AK 2020-04-25 1 0 0
## 3 AK 2020-04-26 1 0 0
## 4 AK 2020-04-27 1 0 0
## 5 AK 2020-04-28 1 0 0
## 6 AK 2020-04-29 1 0 0
We first filtered the dataset to include only state level policies. Next, we picked policies that we felt were likely to be impactful and were more widely applied like “Mandate Face Mask Use By All Individuals In Public Facing Businesses”, “Mandate Face Mask Use By All Individuals In Public Spaces” and “policy_type_Food and Drink”. We have then created new columns called as Start and Stop that indicates policy implementation state. Basis this, dummies were created for each day if policy was present or not.
Having cleaned the individual data files to include date and state as a primary key, we then merged deaths, cases, vaccines, and policy into a single dataset on these keys. To account for policies not in place during the time when we had data for other variables, we also back filled 0s to represent no policy where NAs were induced due to the merge.
## Merge Datasets Together
# Merge cases, deaths, and vaccination data first
df <- merge(cases, deaths, by = c("Date", "State"), all = TRUE)
df <- merge(df, vaccination, by = c("Date", "State"), all.x = TRUE)
# Now merge the policy data
df <- merge(df, long_policy, by = c("Date", "State"), all = TRUE)
# Replace NAs with appropriate values, adjust according to data understanding
df[is.na(df)] <- 0 #basically where policy is not present, for those dates, 0 is filled
# Display example of finished data
head(df)
## Date State CovidCases Weekly_CovidCases DeathCount Weekly_DeathCount
## 1 2020-02-03 AK 0 0 0 0
## 2 2020-02-03 AL 0 0 0 0
## 3 2020-02-03 AR 0 0 0 0
## 4 2020-02-03 AZ 0 0 0 0
## 5 2020-02-03 CA 0 0 0 0
## 6 2020-02-03 CO 0 0 0 0
## Administered_Dose1_Pop_Pct Weekly_Vacc_all
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Administered_Dose1_Recip_65PlusPop_Pct Weekly_Vacc_65 Food_Drink Masks_Bus
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Masks_Public
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
We also believe that there is some regionality to our data. The literature indicates that policy diffuses locally faster than nationally and travel restrictions intended to limit the spread of the virus would indicate that regional deaths might be correlated. To check this, we included a variable representing regional groupings based on [US Census]https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf regions.
# Region region key
regions <- data.frame(read.csv("StatetoRegion.csv"))
# Merge region into master file
df <- merge(df, regions, by = "State", all.x = TRUE)
# Generate dummy variables for region
df <- dummy_cols(df, select_columns = "Region")
# Drop region column
df$Region <- NULL
# Drop false states
df <- df %>% drop_na(Region_MidWest)
# View the dataframe with dummy variables
head(df)
## State Date CovidCases Weekly_CovidCases DeathCount Weekly_DeathCount
## 1 AK 2021-01-23 0 0 0 0
## 2 AK 2020-10-22 0 0 0 0
## 3 AK 2021-11-27 0 0 0 0
## 4 AK 2022-10-05 0 0 0 0
## 5 AK 2021-03-17 0 0 0 0
## 6 AK 2020-04-29 0 0 0 0
## Administered_Dose1_Pop_Pct Weekly_Vacc_all
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Administered_Dose1_Recip_65PlusPop_Pct Weekly_Vacc_65 Food_Drink Masks_Bus
## 1 0 0 0 0
## 2 0 0 0 1
## 3 0 0 0 1
## 4 0 0 0 1
## 5 0 0 0 1
## 6 0 0 1 0
## Masks_Public Location Region_MidWest Region_NorthEast Region_South
## 1 1 Alaska 0 0 0
## 2 0 Alaska 0 0 0
## 3 0 Alaska 0 0 0
## 4 0 Alaska 0 0 0
## 5 0 Alaska 0 0 0
## 6 0 Alaska 0 0 0
## Region_West Region_NA
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
For midterm, we have done EDA region-wise based on SMART questions chosen then. For final, we have decided to analyze further at state-level and so thorough EDA analysis has been done for the same. We started our analysis with checking if there were any null values. Initially, we accidentally removed negative values thinking it was an error within the data. But later we realized that those negative values have to be considered and they are result of not lagging the vaccination information by 2 weeks.
summary(df)
## State Date CovidCases Weekly_CovidCases
## Length:87046 Min. :2020-02-03 Min. : 0 Min. :-49462
## Class :character 1st Qu.:2020-12-16 1st Qu.: 0 1st Qu.: 0
## Mode :character Median :2021-08-23 Median : 0 Median : 0
## Mean :2021-08-28 Mean : 164357 Mean : 2025
## 3rd Qu.:2022-05-07 3rd Qu.: 0 3rd Qu.: 0
## Max. :2023-07-24 Max. :11300080 Max. :825195
## DeathCount Weekly_DeathCount Administered_Dose1_Pop_Pct Weekly_Vacc_all
## Min. : 0 Min. :-3710 Min. : 0.0 Min. :-27.7
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0 Median : 0 Median : 0.0 Median : 0.0
## Mean : 2215 Mean : 22 Mean : 7.3 Mean : 0.2
## 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. :102356 Max. : 6650 Max. :95.0 Max. : 35.1
## Administered_Dose1_Recip_65PlusPop_Pct Weekly_Vacc_65 Food_Drink
## Min. : 0.0 Min. :-7.6 Min. :0.000
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.:0.000
## Median : 0.0 Median : 0.0 Median :0.000
## Mean : 9.9 Mean : 0.2 Mean :0.056
## 3rd Qu.: 0.0 3rd Qu.: 0.0 3rd Qu.:0.000
## Max. :100.0 Max. :81.7 Max. :1.000
## Masks_Bus Masks_Public Location Region_MidWest
## Min. :0.000 Min. :0.000 Length:87046 Min. :0.00
## 1st Qu.:0.000 1st Qu.:0.000 Class :character 1st Qu.:0.00
## Median :1.000 Median :0.000 Mode :character Median :0.00
## Mean :0.508 Mean :0.403 Mean :0.12
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:0.00
## Max. :1.000 Max. :1.000 Max. :1.00
## Region_NorthEast Region_South Region_West Region_NA
## Min. :0.000 Min. :0.00 Min. :0.00 Min. :0
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0
## Median :0.000 Median :0.00 Median :0.00 Median :0
## Mean :0.211 Mean :0.36 Mean :0.31 Mean :0
## 3rd Qu.:0.000 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:0
## Max. :1.000 Max. :1.00 Max. :1.00 Max. :0
#checking if there are any null values
sum(is.na(df))
## [1] 0
We found no missing values in our dataset. This is by design, but checking to make sure none had been added is smart before we begin the EDA. Next we focused on outlier removal in the data for critical columns. We started with plotting COVID-19 cases per each state before outlier removal, performed outlier removal using 1.5 Inter Quartile Range which is commonly used in Statistics and updated the dataframe accordingly, plotted the COVID-19 cases per state after removing outliers.
# Boxplot of number of COVID-19 cases by state before removing outliers
ggplot(df, aes(x = State, y = Weekly_CovidCases)) +
geom_boxplot() +
labs(title = "Box Plot of COVID-19 Cases by State (Before removing outliers)", x = "State", y = "COVID-19 Cases") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#selecting the columns that are critical for analysis and then removing outlier for them
columns_to_clean <- c("CovidCases", "Weekly_CovidCases", "DeathCount", "Weekly_DeathCount",
"Administered_Dose1_Pop_Pct", "Weekly_Vacc_all", "Administered_Dose1_Recip_65PlusPop_Pct", "Weekly_Vacc_65")
#defining a function to recursively remove outliers for all required columns
remove_outliers <- function(data) {
#calculating interquartile range
q1 <- quantile(data, 0.25, na.rm = TRUE)
q3 <- quantile(data, 0.75, na.rm = TRUE)
iqr <- q3 - q1
#defining lower and upper bounds based on commonly used 1.5 IQR rule
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
# Removing outliers (i.e value that is outside lower and upper bound)
data_clean <- data[which(data >= lower_bound & data <= upper_bound)]
return(list(cleaned_data = data_clean, lower_bound = lower_bound, upper_bound = upper_bound))
}
# Applying the function and plotting
for (column in columns_to_clean) {
#calling the function defined above to remove outliers
results <- remove_outliers(df[[column]])
#updating dataframe after removing outliers and saving to df_clean
df_clean <- df %>%
filter(!!sym(column) >= results$lower_bound & !!sym(column) <= results$upper_bound)
# Plotting histogram for each column before outlier removal
p_before <- ggplot(df, aes_string(x = column)) +
geom_histogram(binwidth = (max(df[[column]], na.rm = TRUE) - min(df[[column]], na.rm = TRUE)) / 30, fill = "skyblue") +
ggtitle(paste("Before Outlier Removal -", column))
print(p_before)
# Plotting histogram for each column after outlier removal
p_after <- ggplot(df_clean, aes_string(x = column)) +
geom_histogram(binwidth = (max(df_clean[[column]], na.rm = TRUE) - min(df_clean[[column]], na.rm = TRUE)) / 30, fill = "skyblue") +
ggtitle(paste("After Outlier Removal -", column))
print(p_after)
}
df <- df_clean
# Boxplot for COVID-19 cases by state after removing outliers
ggplot(df, aes(x = State, y = Weekly_CovidCases)) +
geom_boxplot() +
labs(title = "Box Plot of COVID-19 Cases by State (after removing outliers)", x = "State", y = "COVID-19 Cases") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The next set of visualizations show our three largest variables, Deaths, Cases and Vaccines over time for each state. We tried coming up with a clever solution for displaying all the states in a single graph to better visualize each variables impact on each state. The faceted time series graphs below do a great job of displaying a huge amount of data in a single image. From the below graphs you can see a huge spikes in vaccines administered, cases and deaths in certain states.
# GGplots of Deaths, Cases and Vaccines
# COVID deaths over time by state
ggplot(df, aes(x = Date, y = Weekly_DeathCount)) +
geom_line() +
theme_minimal() +
labs(title = "COVID Deaths Over Time", x = "Date", y = "Deaths")
# COVID cases over time by state
ggplot(df, aes(x = Date, y = Weekly_CovidCases, color = State)) +
geom_line() +
theme_minimal() +
labs(title = "COVID Cases Over Time by State", x = "Date", y = "COVID Cases")
# Vaccination rates over time by state
ggplot(df, aes(x = Date, y = Weekly_Vacc_all, color = State)) +
geom_line() +
theme_minimal() +
labs(title = "Vaccination Rates Over Time by State",
x = "Date",
y = "Vaccination Rate (%)")
#creating new dataframe with a new column called as region for better plotting wrt region
new_df <- df %>%
mutate(Region = case_when(
Region_MidWest == 1 ~ "MidWest",
Region_NorthEast == 1 ~ "NorthEast",
Region_South == 1 ~ "South",
Region_West == 1 ~ "West",
TRUE ~ "Other" # Default case if none of the above is true
))
# Faceted time series plot for Covid Cases in different Regions
ggplot(new_df, aes(x=Date, y=Weekly_CovidCases, color=Region)) +
geom_line() +
facet_wrap(~State) +
ggtitle("Daily Covid Cases by Region")
# Faceted time series plot for Covid Cases in different Regions
ggplot(new_df, aes(x=Date, y=Weekly_DeathCount, color=Region)) +
geom_line() +
facet_wrap(~State) +
ggtitle("Daily Covid Deaths by Region")
# Faceted time series plot for Covid Cases in different Regions
ggplot(new_df, aes(x=Date, y=Weekly_Vacc_all, color=Region)) +
geom_line() +
facet_wrap(~State) +
ggtitle("Daily Covid Vaccines by Region")
# Faceted time series plot for Covid Cases in different states
ggplot(df, aes(x=Date, y=Weekly_CovidCases, color=State)) +
geom_line() +
facet_wrap(~State) +
ggtitle("Daily Covid Cases by State")
# Faceted time series plot for Covid Cases in different states
ggplot(df, aes(x=Date, y=Weekly_DeathCount, color=State)) +
geom_line() +
facet_wrap(~State) +
ggtitle("Daily Covid Deaths by State")
# Faceted time series plot for Covid Cases in different states
ggplot(df, aes(x=Date, y=Weekly_Vacc_all, color=State)) +
geom_line() +
facet_wrap(~State) +
ggtitle("Daily Covid Vaccines by State")
Now, we wanted to check how vaccinations played role with respect to cases and deaths in each state, region as well. We tried to plot few visualizations and found that there was a large spike in almost every state when the vaccine initially rolled out, and that California, Texas and New York were already experiencing a huge number of reported cases as well.
#Creating a State Summary here for later creating a scatter plot basis this for state level
state_summary <- df %>%
group_by(State) %>%
summarize(
Total_Weekly_CovidCases = sum(Weekly_CovidCases, na.rm = TRUE), # Total weekly COVID cases
Total_Weekly_DeathCount = sum(Weekly_DeathCount, na.rm = TRUE), # Total weekly deaths
Avg_Weekly_VaccinationRate = mean(Weekly_Vacc_all, na.rm = TRUE), # Average weekly vaccination rate
Total_CovidCases = sum(CovidCases, na.rm = TRUE), # Total COVID cases
Total_DeathCount = sum(DeathCount, na.rm = TRUE), # Total deaths
Avg_VaccinationRate = mean(Administered_Dose1_Pop_Pct, na.rm = TRUE), # Average vaccination rate
Avg_VaccinationRate_65Plus = mean(Administered_Dose1_Recip_65PlusPop_Pct, na.rm = TRUE) # Average vaccination rate for 65+
)
# Scatter plot for COVID-19 Cases vs Vaccination Rate at state level
ggplot(state_summary, aes(x = Avg_Weekly_VaccinationRate, y = Total_Weekly_CovidCases)) +
geom_point(aes(color = State), size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Relationship Between Weekly Vaccination Rate and Weekly COVID-19 Cases by State",
x = "Average Weekly Vaccination Rate (%)",
y = "Total Weekly COVID-19 Cases",
color = "State") +
theme_minimal()
# Scatter plot for COVID-19 Death count vs Vaccination Rate at state level
ggplot(state_summary, aes(x = Avg_Weekly_VaccinationRate, y = Total_Weekly_DeathCount)) +
geom_point(aes(color = State), size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Relationship Between Weekly Vaccination Rate and Weekly Death by State",
x = "Average Weekly Vaccination Rate (%)",
y = "Total Weekly COVID-19 Cases",
color = "State") +
theme_minimal()
#Region summary for plotting at region level
region_summary <- new_df %>%
group_by(Region) %>%
summarize(
Total_Weekly_CovidCases = sum(Weekly_CovidCases, na.rm = TRUE), # Total weekly COVID cases
Total_Weekly_DeathCount = sum(Weekly_DeathCount, na.rm = TRUE), # Total weekly deaths
Avg_Weekly_VaccinationRate = mean(Weekly_Vacc_all, na.rm = TRUE), # Average weekly vaccination rate
Total_CovidCases = sum(CovidCases, na.rm = TRUE), # Total COVID cases
Total_DeathCount = sum(DeathCount, na.rm = TRUE), # Total deaths
Avg_VaccinationRate = mean(Administered_Dose1_Pop_Pct, na.rm = TRUE), # Average vaccination rate
Avg_VaccinationRate_65Plus = mean(Administered_Dose1_Recip_65PlusPop_Pct, na.rm = TRUE) # Average vaccination rate for 65+
)
#Scatter plot of COVID-19 Cases vs Vaccination Rate at Region level
ggplot(region_summary, aes(x = Avg_Weekly_VaccinationRate, y = Total_Weekly_CovidCases, color = Region)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Relationship Between Weekly Vaccination Rate and Weekly COVID-19 Cases by Region",
x = "Average Weekly Vaccination Rate (%)",
y = "Total Weekly COVID-19 Cases",
color = "Region") +
theme_minimal()
#Scatter plot of COVID-19 Death Count vs Vaccination Rate at Region level
ggplot(region_summary, aes(x = Avg_Weekly_VaccinationRate, y = Total_Weekly_DeathCount, color = Region)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Relationship Between Weekly Vaccination Rate and Weekly Death by Region",
x = "Average Weekly Vaccination Rate (%)",
y = "Total Weekly COVID-19 Cases",
color = "Region") +
theme_minimal()
Next, we tried to visualize if there is any correlation between COVID-19 Cases, Deaths and Vaccination Rates. We started with selecting the required columns and then found correlation matrix, visualized as a heatmap.
# selecting the required columns for checking correlation matrix.
cor_columns <- c("Weekly_CovidCases", "Weekly_DeathCount", "Weekly_Vacc_all", "CovidCases", "DeathCount", "Administered_Dose1_Pop_Pct")
cor_matrix <- cor(df[cor_columns], use = "complete.obs")
#Heatmap of the correlation matrix
ggplot(data = as.data.frame(as.table(cor_matrix)), aes(Var1, Var2, fill = Freq)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
theme_minimal() +
labs(title = "Correlation Matrix of COVID-19 Metrics (Weekly and Total)",
x = "Variables",
y = "Variables",
fill = "Correlation Coefficient") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank())
Later we were interested to check the frequency distribution of cases by state and mask policy and so we decided to view contigency table. Some key take aways we gleaned from the below data include; larger states like California (CA) and Texas (TX) haved notably higher case counts; several states, such as Arizona, Florida and Georgia, show significant case numbers with no reported cases under a mask mandate, suggesting either a strong resistance to mask mandates or that mandates were introduced too late or removed too early. Additionally, D, MO, MT, NE, OK, SC, SD, TN, and WY show COVID-19 cases only in the ‘No Mask’ category.
# Basic contingency table to see the distribution of cases by state and mask policy
mask_policy_table_basic <- table(df$State, df$Masks_Public)
#summary for viewing better
df_summary <- df %>%
group_by(State, Masks_Public) %>%
summarize(
Total_CovidCases = sum(CovidCases, na.rm = TRUE),
Total_Weekly_CovidCases = sum(Weekly_CovidCases, na.rm = TRUE),
Total_Weekly_DeathCount = sum(Weekly_DeathCount, na.rm = TRUE),
Avg_Weekly_VaccinationRate = mean(Weekly_Vacc_all, na.rm = TRUE),
Total_DeathCount = sum(DeathCount, na.rm = TRUE),
.groups = 'drop'
)
#visualizing contingency table
mask_policy_table_detailed <- xtabs(Total_CovidCases ~ State + Masks_Public, data = df_summary)
print(mask_policy_table_detailed)
## Masks_Public
## State 0 1
## AK 1.77e+07 9.18e+06
## AL 1.20e+08 6.84e+07
## AR 6.49e+07 3.56e+07
## AZ 1.94e+08 0.00e+00
## CA 9.07e+08 5.71e+08
## CO 1.36e+08 8.25e+07
## CT 8.02e+07 4.96e+07
## DC 1.47e+07 8.82e+06
## DE 2.66e+07 1.66e+07
## FL 5.96e+08 0.00e+00
## GA 1.60e+08 0.00e+00
## HI 2.83e+07 0.00e+00
## IA 7.38e+07 4.55e+07
## ID 3.47e+07 0.00e+00
## IL 3.03e+08 1.91e+08
## IN 9.79e+07 3.58e+07
## KS 7.71e+07 4.89e+07
## KY 1.25e+08 7.40e+07
## LA 6.28e+07 2.33e+07
## MA 1.65e+08 1.04e+08
## MD 1.04e+08 6.53e+07
## ME 2.40e+07 1.45e+07
## MI 2.37e+08 1.33e+08
## MN 1.31e+08 8.38e+07
## MO 1.18e+08 0.00e+00
## MS 2.99e+07 7.99e+06
## MT 2.18e+07 0.00e+00
## NC 2.77e+08 1.65e+08
## ND 2.33e+07 1.43e+07
## NE 4.73e+07 0.00e+00
## NH 3.00e+07 1.87e+07
## NJ 2.34e+08 1.44e+08
## NM 5.42e+07 3.37e+07
## NV 7.03e+07 4.38e+07
## NY 5.12e+08 3.12e+08
## OH 1.05e+08 3.27e+07
## OK 1.02e+08 0.00e+00
## OR 7.06e+07 4.25e+07
## PA 3.07e+08 2.01e+08
## RI 3.60e+07 2.26e+07
## SC 1.27e+08 0.00e+00
## SD 2.36e+07 0.00e+00
## TN 7.24e+07 0.00e+00
## TX 6.31e+08 3.80e+08
## UT 8.98e+07 5.56e+07
## VA 1.74e+08 1.05e+08
## VT 1.22e+07 7.62e+06
## WA 1.46e+08 8.77e+07
## WI 1.60e+08 9.80e+07
## WV 2.55e+07 6.13e+06
## WY 8.26e+06 0.00e+00
Finally, in the last set of graphs we wanted to clearly display on a state to state level the comparison of mask policy vs deaths, cases and vaccine rates. The results below show states before and after implementing the mask mandate, and states that never implemented it. Cases and deaths are higher across the board for stats that never implemented a mask mandate, while states that did saw drops between their unmasked numbers. Interestingly, in the fourth graphic below, 3 states stand out as having high vaccine rate, but no mask mandate. Tennessee, South Dakota and Idaho all have unusually high vaccine rates, suggesting either an issue with reporting, or perhaps these states implemented vaccine policies or rollouts to offset the lack of mask mandate.
#Total Weekly COVID-19 Cases
ggplot(df_summary, aes(x = State, y = Total_Weekly_CovidCases, fill = as.factor(Masks_Public))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("red", "blue"), labels = c("No Mask Mandate", "Mask Mandate")) +
labs(title = "Total Weekly COVID-19 Cases by State and Mask Policy",
x = "State",
y = "Total Weekly Cases",
fill = "Mask Policy") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Total Weekly Death Count
ggplot(df_summary, aes(x = State, y = Total_Weekly_DeathCount, fill = as.factor(Masks_Public))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("red", "blue"), labels = c("No Mask Mandate", "Mask Mandate")) +
labs(title = "Total Weekly Death Count by State and Mask Policy",
x = "State",
y = "Total Weekly Deaths",
fill = "Mask Policy") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Average Weekly Vaccination Rate
ggplot(df_summary, aes(x = State, y = Avg_Weekly_VaccinationRate, fill = as.factor(Masks_Public))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("red", "blue"), labels = c("No Mask Mandate", "Mask Mandate")) +
labs(title = "Average Weekly Vaccination Rate by State and Mask Policy",
x = "State",
y = "Average Weekly Vaccination Rate (%)",
fill = "Mask Policy") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Total COVID-19 Cases
ggplot(df_summary, aes(x = State, y = Total_CovidCases, fill = as.factor(Masks_Public))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("red", "blue"), labels = c("No Mask Mandate", "Mask Mandate")) +
labs(title = "Total COVID-19 Cases by State and Mask Policy",
x = "State",
y = "Total COVID-19 Cases",
fill = "Mask Policy") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Total Death Count
ggplot(df_summary, aes(x = State, y = Total_DeathCount, fill = as.factor(Masks_Public))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("red", "blue"), labels = c("No Mask Mandate", "Mask Mandate")) +
labs(title = "Total Death Count by State and Mask Policy",
x = "State",
y = "Total Deaths",
fill = "Mask Policy") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
After completing the EDA we wanted to check if reaching a certain threshold of vaccination was supportive of the hypothesis that vaccination was reducing the number of covid deaths. Since the CDC didn’t specifically reccomend a threshold that represented a reasonable level when we could expect vaccinations to slow deaths, we chose a 30% vaccination rate for the US. The reference US population for the COVID data we used was 328,239,523 so 30% was 98,471,857 which was suprassed on 11/3/2021. We applied the 20% threshold state buy state meaning that this threshold was reached on a variety of dates centered around the national threshold date.
# Apply threshold to create new binary indicator
df$death_case_ratio <- with(df, ifelse(CovidCases > 0, DeathCount / CovidCases, 0))
df$threshold_30 <- ifelse(df$Administered_Dose1_Pop_Pct >= 30, "after", "before")
# T-test for deaths
t_deaths <- t.test(DeathCount ~ threshold_30, data = df)
# T-test for cases
t_cases <- t.test(CovidCases ~ threshold_30, data = df)
# T-test for death/case ratio
t_ratio <- t.test(death_case_ratio ~ threshold_30, data = df)
For deaths we found that the p-Value was effectively 0 allowing us to reject the null hypothesis. Our results indicate that cases increased after the threshold was met. This also isn’t surprising as the pandemic was just getting started in most states and even a 30% vaccination rate still meant many were still vulnerable.
For cases we found that the p-Value was effectively 0 allowing us to reject the null hypothesis. Our results indicate that cases increased after the threshold was met. This also isn’t surprising for the same reason as above.
For ration of death/case we found that the p-Value was effectively 0 allowing us to reject the null hypothesis. This result indicates that the death/case ratio was lower after the threshold was met. This was the intent of the vaccination and suggests that it was effective. However, a single t-test isn’t sufficient evidence.
#Build regions as factors
df$Region <- ifelse(df$Region_MidWest == 1, "MidWest",
ifelse(df$Region_NorthEast == 1, "NorthEast",
ifelse(df$Region_South == 1, "South", "West")))
df$Region <- as.factor(df$Region)
We also wanted to test the hypothesis that the regions were typically similar. We start with a visual inspection of the regional data:
This plot indicate that while the relationship between cases and deaths is clearly positive, the slope of the positive relationship seems different between regions (see trend lines). To check this, we can run an ANOVA test to determine if the variation between groups was higher than the variation within the groups.
#one way anova
# Ho: The mean among the regions were the same
# Ha: The mean among the regions differed
# ANOVA for deaths across the regions
anova_deaths <- aov(DeathCount ~ Region, data = df)
# ANOVA for cases across the regions
anova_cases <- aov(CovidCases ~ Region, data = df)
# ANOVA for death/case ratio across regions
anova_ratio <- aov(death_case_ratio ~ Region, data = df)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Region | 3 | 7.846e+09 | 2.615e+09 | 40.91 | 2.083e-26 |
| Residuals | 82845 | 5.296e+12 | 63931916 | NA | NA |
For deaths we find that there is certainly a regional component with the p-value close to 0 indicating that the between region means are quite different.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Region | 3 | 4.101e+13 | 1.367e+13 | 28.59 | 1.817e-18 |
| Residuals | 82845 | 3.961e+16 | 4.781e+11 | NA | NA |
Here too for casers deaths we find that there is certainly a regional component with the p-value close to 0 indicating that the between region means are quite different.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Region | 3 | 0.02051 | 0.006836 | 8.574 | 1.092e-05 |
| Residuals | 82845 | 66.05 | 0.0007973 | NA | NA |
However, we a re most interested in how the ratio between deaths and cases differs between regions. For the death/case ratio we find that there are still differences between the regional means, p-value is less than 0.05 at 1.1e5, but larger than either of the two components of the ratio. This indicates that the relationship isn’t as strong and we might need to check the interactions with a Post-hoc test.
A Tukey post-hoc test allows us to determine which region interactions might be significant rather than just the overall significance of the ANOVA test.
# Post-hoc test for deaths
posthoc_deaths <- TukeyHSD(anova_deaths)
# Post-hoc test for cases
posthoc_cases <- TukeyHSD(anova_cases)
# Post-hoc test for death/case ratio
posthoc_ratio <- TukeyHSD(anova_ratio)
| Length | Class | Mode | |
|---|---|---|---|
| Region | 24 | -none- | numeric |
Again, as we are most interested in the death/case ratio we examine the Tukey test for that outcome. As we surmised earlier, the interactions between regions aren’t all significant. In fact, the differences between South-Northeast and West-Northeast are the only significant interactions (South-Midwest is just short of the 0.05 threshold). This is confirms what we saw in the graphical inspection as the slopes of Northeast was the steepest while Midwest was in the middle of the group.
We also might think that there may be a difference between years in the pandemic. While a true time series model is a bit out of our scope, including time dummies will serve a similar purpose. Since we believe that the impacts of COVID changed over time, think the variants of the virus appearing, we want to check if the death/case rate changed year over year. To do this, we run an ANOVA test.
# Extract year from date
df$Year <- format(df$Date, format="%Y")
To test the hypothesis that the years were typically similar. We start with a visual inspection of the regional data:
This plot indicate that while the relationship between cases and deaths is clearly positive, the slope of the positive relationship seems different between years (see trend lines). To check this, we can run an ANOVA test to determine if the variation between groups was higher than the variation within the groups.
#one way anova
# Ho: The mean among the years were the same
# Ha: The mean among the years differed
# ANOVA for death/case ratio across regions
anova_ratio_y <- aov(death_case_ratio ~ Year, data = df)
# Post-hoc test for death/case ratio
posthoc_ratio_y <- TukeyHSD(anova_ratio_y)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Year | 3 | 0.6121 | 0.204 | 258.2 | 8.249e-167 |
| Residuals | 82845 | 65.46 | 0.0007901 | NA | NA |
The result is that the p-value is close to 0 and thus we reject the null hypothesis and can confirm that the means among years are likely more different than the means within years. We again ran a Tukey post-hoc test to see where the relationships between years were the most different.
| Length | Class | Mode | |
|---|---|---|---|
| Year | 24 | -none- | numeric |
For most years, the mean of death/case is different. Only the 2022-2021 relationship isn’t significant, but it is quite close to our threshold. This indicates that we should also conder these variables in our regression.
Given our findings in hypotheses testing, it’s logical to move to a regression model to examine the relationships between death/case ratio and factors like policy, vaccination, and region. As our outcome variable, death/case, is continuous we will start by fitting an OLS regression and checking our assumptions. As our variable of interest, vaccination rate, is also continuous, we can model this relationship directly as our base (reference) model. However, we will want to rescale the values to help keep the interpretation simple. Thus we divide the percentage by 100.
# Rescale
df$Administered_Dose1_Pop_Pct <- df$Administered_Dose1_Pop_Pct/100
# Base model
base <- lm(death_case_ratio ~ Administered_Dose1_Pop_Pct, data = df)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.002298 | 0.000101 | 22.74 | 3.766e-114 |
| Administered_Dose1_Pop_Pct | 0.011 | 0.0005051 | 21.77 | 8.329e-105 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 82849 | 0.02816 | 0.005689 | 0.005677 |
The relationship between death/case ratio and vaccination rate is significant and positive. Since we might hypothesize that this relationship should be negative based on other research, there’s clearly something driving death/case ratio beyond vaccinations. This is reflected in the low adjusted R^2 (explanation of variance) value 0.006. Thus it makes sense to begin adding other variables we have shown to be of importance.
Let’s start with regionality. We previous built dummies for regions so let’s re-run the regression with those added.
reg_1 <- lm(death_case_ratio ~ Administered_Dose1_Pop_Pct + Region_NorthEast + Region_South + Region_West, data = df)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.002947 | 0.0002842 | 10.37 | 3.614e-25 |
| Administered_Dose1_Pop_Pct | 0.01097 | 0.0005057 | 21.68 | 5.696e-104 |
| Region_NorthEast | -2.674e-06 | 0.0003542 | -0.007551 | 0.994 |
| Region_South | -0.001011 | 0.0003276 | -3.086 | 0.002032 |
| Region_West | -0.0009225 | 0.0003341 | -2.761 | 0.005757 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 82849 | 0.02816 | 0.005952 | 0.005904 |
This had an marginal improvement on the adjusted R^2 value raising it to a still low 0.006. However, the model shows regionality is certainly correlated with the death/case ratio. I will note that the midwest region was removed from the regression to prevent autocorrelation. It is now the reference.
We can next add in some policy values to see if that improves the model. These are already dummies in our data so no special action is required to transform them.
reg_2 <- lm(death_case_ratio ~ Administered_Dose1_Pop_Pct + Region_NorthEast + Region_South + Region_West + Food_Drink + Masks_Bus + Masks_Public, data = df)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.02169 | 0.0006528 | 33.22 | 2.228e-240 |
| Administered_Dose1_Pop_Pct | 0.006073 | 0.0005279 | 11.5 | 1.326e-30 |
| Region_NorthEast | 0.0003549 | 0.0003524 | 1.007 | 0.3139 |
| Region_South | -0.001195 | 0.0003257 | -3.668 | 0.0002445 |
| Region_West | -0.0009919 | 0.0003325 | -2.983 | 0.002851 |
| Food_Drink | -0.01515 | 0.000717 | -21.13 | 7.354e-99 |
| Masks_Bus | -0.01922 | 0.000602 | -31.94 | 1.887e-222 |
| Masks_Public | -0.01943 | 0.0006062 | -32.05 | 5.391e-224 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 82849 | 0.02797 | 0.01941 | 0.01933 |
These new variables are quite significant and have increased the adjusted R^2 value all the way to 0.019! Still, this value represents under 2% of the variance being explained by the model. We can thus still improve.
Our next option is to add years to our model. As year is continuous, we can first try the variable as a linear fit. Again, there is a scale issue. We thus subtract 2020 and divide by 4 (the number of years in the data).
# Scale Year
df$s_year <- (as.numeric(df$Year) - 2020)/4
# Regression
reg_3 <- lm(death_case_ratio ~ Administered_Dose1_Pop_Pct + Region_NorthEast + Region_South + Region_West + Food_Drink + Masks_Bus + Masks_Public + s_year, data = df)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.02671 | 0.0006787 | 39.35 | 0 |
| Administered_Dose1_Pop_Pct | 0.00996 | 0.0005469 | 18.21 | 5.705e-74 |
| Region_NorthEast | 0.0001673 | 0.0003511 | 0.4766 | 0.6336 |
| Region_South | -0.001175 | 0.0003244 | -3.621 | 0.0002937 |
| Region_West | -0.001072 | 0.0003312 | -3.237 | 0.001209 |
| Food_Drink | -0.01981 | 0.0007366 | -26.89 | 1.235e-158 |
| Masks_Bus | -0.02059 | 0.0006019 | -34.22 | 8.211e-255 |
| Masks_Public | -0.02064 | 0.0006056 | -34.08 | 7.363e-253 |
| s_year | -0.01323 | 0.0005123 | -25.82 | 1.867e-146 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 82849 | 0.02785 | 0.02724 | 0.02714 |
Year is quite significant and increased the adjusted R^2 value all the way to 0.027. There is still room for improvement. We assumed that year was linearly related to death/case, but time relationships are rarely linear. Let’s try using time as a dummy.
# Scale Year
df$Year <- as.factor(df$Year)
# Regression
reg_4 <- lm(death_case_ratio ~ Administered_Dose1_Pop_Pct + Region_NorthEast + Region_South + Region_West + Food_Drink + Masks_Bus + Masks_Public + Year, data = df)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.04304 | 0.0009188 | 46.85 | 0 |
| Administered_Dose1_Pop_Pct | 0.01121 | 0.0005517 | 20.32 | 1.357e-91 |
| Region_NorthEast | 0.0001173 | 0.0003495 | 0.3356 | 0.7372 |
| Region_South | -0.001319 | 0.000323 | -4.085 | 4.421e-05 |
| Region_West | -0.001203 | 0.0003297 | -3.649 | 0.0002633 |
| Food_Drink | -0.0359 | 0.000961 | -37.36 | 6.698e-303 |
| Masks_Bus | -0.03693 | 0.0008921 | -41.39 | 0 |
| Masks_Public | -0.03698 | 0.0008964 | -41.26 | 0 |
| Year2021 | -0.004648 | 0.0002598 | -17.89 | 1.813e-71 |
| Year2022 | -0.005237 | 0.0002643 | -19.81 | 3.74e-87 |
| Year2023 | -0.03803 | 0.001154 | -32.95 | 1.356e-236 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 82849 | 0.02773 | 0.03612 | 0.03601 |
Year clearly was non-linear as each of the years is significant and their inclusion increased the adjusted R^2 value all the way to 0.036. Year2020 is a reference in this model.
Still, the adjusted r^2 indicates we are only explaining 3.6% of the variance. Let’s check to see if there are observations that are hurting the model fit. To do this, we use cook’s distance.
Let’s start by plotting residuals.
par(mfrow = c(2, 2))
plot(reg_4)
We definitely have several data points over the .5 cook’s distance
threshold. This means they are high leverage and might be influencing
our fit. Let’s remove them.
# Cook's distances
cooks_distances <- cooks.distance(reg_4)
# make a threshold
threshold_99 <- quantile(cooks_distances, 0.99)
# identify influential rows
influential <- cooks_distances[(cooks_distances > threshold_99)]
# get row numbers
names_of_influential <- names(influential)
# seperate influential outliers
outliers <- df[names_of_influential,]
# Remove outliers from data
df_NO <- df %>% anti_join(outliers)
# rerun model
reg_5 <- lm(death_case_ratio ~ Administered_Dose1_Pop_Pct + Region_NorthEast + Region_South + Region_West + Food_Drink + Masks_Bus + Masks_Public + Year, data = df_NO)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.03202 | 0.0004175 | 76.68 | 0 |
| Administered_Dose1_Pop_Pct | 0.01264 | 0.0001511 | 83.67 | 0 |
| Region_NorthEast | 0.0007637 | 9.511e-05 | 8.029 | 9.935e-16 |
| Region_South | -0.0001947 | 8.794e-05 | -2.214 | 0.02686 |
| Region_West | -0.0004284 | 8.975e-05 | -4.773 | 1.819e-06 |
| Food_Drink | -0.0276 | 0.0004241 | -65.09 | 0 |
| Masks_Bus | -0.02729 | 0.0004133 | -66.03 | 0 |
| Masks_Public | -0.02736 | 0.000414 | -66.09 | 0 |
| Year2021 | -0.004003 | 7.053e-05 | -56.75 | 0 |
| Year2022 | -0.004572 | 7.184e-05 | -63.64 | 0 |
| Year2023 | -0.02861 | 0.0004607 | -62.09 | 0 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 82020 | 0.007508 | 0.1797 | 0.1796 |
Well, that helped. Now the adjusted R^2 is 0.18 and quite improved. Still, adjusted R^2 isn’t up to the 30% lower bound we prefer. Let’s see if removing no-information values can help. There’s a large number of 0 death/cases that might confuse the model. Let’s remove those and see if the fit improves.
#drop 0 information cases
df_NO <- subset(df_NO, death_case_ratio>0)
# rerun model
reg_6 <- lm(death_case_ratio ~ Administered_Dose1_Pop_Pct + Region_NorthEast + Region_South + Region_West + Food_Drink + Masks_Bus + Masks_Public + Year, data = df_NO)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.03634 | 0.0007098 | 51.19 | 0 |
| Administered_Dose1_Pop_Pct | -0.007661 | 0.0006075 | -12.61 | 3.577e-36 |
| Region_NorthEast | 0.007143 | 0.0004402 | 16.23 | 1.947e-58 |
| Region_South | -0.003108 | 0.0004052 | -7.671 | 1.87e-14 |
| Region_West | -0.004629 | 0.0004124 | -11.22 | 4.682e-29 |
| Food_Drink | -0.0003496 | 0.0007594 | -0.4604 | 0.6453 |
| Masks_Bus | -0.001987 | 0.0006413 | -3.099 | 0.001948 |
| Masks_Public | -0.002553 | 0.0006518 | -3.917 | 9.022e-05 |
| Year2021 | -0.01505 | 0.0003854 | -39.05 | 4.526e-310 |
| Year2022 | -0.01641 | 0.0005588 | -29.38 | 8.32e-182 |
| Year2023 | -0.01983 | 0.0007557 | -26.24 | 1.013e-146 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 9744 | 0.01099 | 0.5098 | 0.5092 |
Well, that did it. The adjusted R^2 is now 0.509 and well over our threshold of 30%.
We need to next check the assumptions of an OLS regression. OLS
regression has the following assumptions:
1) Linearity 2) No endogeneity 3) Normality and homoscedasticity 4) No
autocorrelation 5) no multicollinearity
We start by plotting residuals.
library(broom)
model.diag.metrics <- augment(reg_6)
ggplot(model.diag.metrics, aes(Administered_Dose1_Pop_Pct, death_case_ratio)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = Administered_Dose1_Pop_Pct, yend = .fitted), color = "red", size = 0.3)
You’ll notice that the data is split into two clusters with a small number of points below the 30% vaccination threshold we checked before fitting poorly to the trend line. This isn’t unexpected, but we should keep this in mind as we check the assumptions.
Next, let’s run diagnostic plots.
par(mfrow = c(2, 2))
plot(reg_6)
Residuals vs. Fitted: This is used to check for linearity in the model.
We are looking for a fairly straight line here but the plot is curved.
Normal Q-Q: This is used to check to see if the residuals are normally
distributed. Since the plot deviates from the dashed line, we probably
don’t have normal distribution of the residuals. Sacle-location: This is
used to check for homogeneity of variance of the residuals
(homoscedasticity). The line is definitely not horizontal which
indicates there might be heteroskedasticity. Residuals vs. leverage:
There are still high leverage points in the data.
It certainly seems like we have some violations of the OLS assumptions. Fortunately, we can transform the Y and hopefully fix the issue. Let’s first plot the death/case variable to see what’s going on.
The data is certainly skewed right. With skewed data the prescribed method to fix this is to log transform. This process is a linear transformation so shouldn’t impact the model. Let’s see what the distribution looks like after doing so.
That’s much better. Let’s try re-running the model and the diagnostics.
reg_7 <- lm(log(death_case_ratio) ~ Administered_Dose1_Pop_Pct + Region_NorthEast + Region_South + Region_West + Food_Drink + Masks_Bus + Masks_Public + Year, data = df_NO)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -3.399 | 0.02662 | -127.7 | 0 |
| Administered_Dose1_Pop_Pct | -0.2953 | 0.02279 | -12.96 | 4.277e-38 |
| Region_NorthEast | 0.1641 | 0.01651 | 9.938 | 3.659e-23 |
| Region_South | -0.1144 | 0.0152 | -7.527 | 5.661e-14 |
| Region_West | -0.2471 | 0.01547 | -15.98 | 9.488e-57 |
| Food_Drink | -0.02424 | 0.02848 | -0.851 | 0.3948 |
| Masks_Bus | -0.1055 | 0.02405 | -4.385 | 1.173e-05 |
| Masks_Public | -0.1226 | 0.02445 | -5.016 | 5.381e-07 |
| Year2021 | -0.4881 | 0.01446 | -33.77 | 1.786e-236 |
| Year2022 | -0.7436 | 0.02096 | -35.48 | 1.98e-259 |
| Year2023 | -0.9115 | 0.02835 | -32.16 | 1.094e-215 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 9744 | 0.4121 | 0.5722 | 0.5718 |
The result is that our adjusted R^2 creeps up to 0.572 but the significant vars don’t change. Let’s check the residuals.
par(mfrow = c(2, 2))
plot(reg_7)
Much better. Residuals vs. Fitted: line is straighter Normal Q-Q: The
plot deviates from the dashed line but far less than before.
Sacle-location: The line is still not horizontal which indicates there
might be heteroskedasticity. Residuals vs. leverage: There are still
high leverage points in the data but are now all under 4 standard
deviations.
It certainly seems like we still have some violations of the OLS assumptions. There are further steps we could take to deal with this (interaction or polynomial terms, non-parametric, splines, etc.) but the goal here isn’t to make a perfect model. Rather it’s to check the relationships between variables and the death/case ratio.
Using the log transformed death/case model (see below) we can extract some insights from the model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -3.399 | 0.02662 | -127.7 | 0 |
| Administered_Dose1_Pop_Pct | -0.2953 | 0.02279 | -12.96 | 4.277e-38 |
| Region_NorthEast | 0.1641 | 0.01651 | 9.938 | 3.659e-23 |
| Region_South | -0.1144 | 0.0152 | -7.527 | 5.661e-14 |
| Region_West | -0.2471 | 0.01547 | -15.98 | 9.488e-57 |
| Food_Drink | -0.02424 | 0.02848 | -0.851 | 0.3948 |
| Masks_Bus | -0.1055 | 0.02405 | -4.385 | 1.173e-05 |
| Masks_Public | -0.1226 | 0.02445 | -5.016 | 5.381e-07 |
| Year2021 | -0.4881 | 0.01446 | -33.77 | 1.786e-236 |
| Year2022 | -0.7436 | 0.02096 | -35.48 | 1.98e-259 |
| Year2023 | -0.9115 | 0.02835 | -32.16 | 1.094e-215 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 9744 | 0.4121 | 0.5722 | 0.5718 |
First and foremost, the vaccination rathe certainly reduced the death/case rate as the coefficient in our model is negative and significant. Similarly, imposing mask mandates (both in public and in businesses) also reduced the death/case rate. Conversely, living in the NorthEast part of the US meant a higher death/case rate compared to all other regions. Finally, the death/case rate dropped over time indicating the possibility of a survivor bias or a building immunity beyond the vaccination rate.
As we conducted this analysis we found several additional factors which might have been predictive but we lacked data on. There were several variants of the COVID-19 virus with different virulence levels that dominated the infections. We were not able to find good data on that confound in a state by state level. Further, multiple vaccinations were available with different efficacy rates and individuals took different combinations of vaccinations. While we did have some of that data, it was extremely difficult for us, non-microbiologist, non-epidemiologists to interpret the information and we thus decided to exclude it. Finally, there is little to no data on the re-infection rate. As many people developed immunity from infection, this is a big piece of missing data in our model.
In future, we believe that a better way to determine the relationship between vaccination and death/case would be to use individual level data rather than state counts. This model then becomes far more predictive as we see outcomes on a case by case basis rather than in total. With the data we used, a large amount of the variation between states is lost and thus we can not be confident enough in our model to suggest policy changes from our results. Rather, we suggest following the CDC recommendations to prevent disease transfer: Cover your mouth, wash your hands, stay home when you’re sick.